AI Interpretability

Rigorously understanding how ML models function may allow us to identify and train against misalignment. Can we reverse engineer neural nets from their weights, or identify structures corresponding to “goals” or dangerous capabilities within a model and surgically alter them?

Mentors

Neel Nanda
Research Engineer, Google DeepMind

Neel leads the mechanistic interpretability team at Google DeepMind, focusing on reverse-engineering the algorithms learned by neural networks to differentiate between helpful and deceptively aligned models and better understand language model cognition.

  • Neel leads the Google DeepMind mechanistic interpretability team. He previously worked on mechanistic interpretability at Anthropic on the transformer circuits agenda and as an independent researcher on reverse-engineering grokking and making better tooling and educational materials for the field.

  • When training an ML model, we may know that it will learn an algorithm with good performance, but it can be very hard to tell which one. This is particularly concerning when "be genuinely helpful and aligned" and "deceive your operators by acting helpful and aligned, until you can decisively act to take power" look behaviorally similar. Mechanistic interpretability is the study of taking a trained neural network, and analysing the weights to reverse engineer the algorithms learned by the model. In contrast to more conventional approaches to interpretability, there is a strong focus on understanding model internals and what they represent, understanding the model’s “cognition”, and putting a high premium on deep and rigorous understanding even if this involves answering very narrow and specific questions. Better interpretability tools seem useful in many ways for alignment, but mechanistic approaches in particular may let us better distinguish deceptive models from aligned ones.

  • What am I looking for in an application?

    Some core skills in mech interp, that I’ll be looking for signs of potential for - I’m excited about both candidates who are OK at all of them, or who really shine at one:

    • Empirical Truth-Seeking: The core goal of interpretability is to form true beliefs about models. The main way to do this is by running experiments, visualising the results, and understanding their implications for what is true about the model.

      • You can show this with transparent reasoning about what you believe to be true, and nuanced arguments for why?

    • Practicality: A willingness to get your hands dirty - writing code, running experiments, and playing with models. A common mistake in people new to the field is too great a focus on reading papers and thinking about things, rather than just doing stuff.

      • You can demonstrate this by just having a bunch of experiments that show interesting things!

    • Scepticism: Interpretability is hard and it is easy to trick yourself. A healthy scepticism must be applied to your results and how you interpret them, and often a well designed experiment can confirm or disprove your assumptions. It’s important to not overclaim!

      • Applications that make a small yet rigorous claim are much better than ones that make fascinating yet bold and wildly overconfident claims

      • You can show this with clear discussion of the limitations of your evidence and alternative hypotheses

    • Agency & Creativity: Being willing to just try a bunch of stuff, generate interesting experiment ideas, and be able to get yourself unstuck

      • You can show this if I read your application and think "wow, I didn't think of that experiment, but it's a good idea"

    • Intuitive reasoning: It helps a lot to have some intuitions for models - what they can and cannot do, how they might implement the behaviour

      • You can show this by discussing your hypotheses going into the investigation, and the motivation behind your experiments. Though "I just tried a bunch of shit to see what happened, and interpreted after the fact" is also a perfectly fine strategy

    • Enthusiasm & Curiosity: Mech interp can be hard, confusing and frustrating, or it can be fascinating, exciting and tantalising. How you feel about it is a big input here, to how good at the research you are and how much fun you have. A core research skill is following your curiosity (and learning the research taste to be curious about productive things!)

      • I know this is easy to fake and hard to judge from an application, so I don’t weight it highly here

    • I also want candidates who are able to present and explain their findings and thoughts clearly.

    • I’m aware that applicants will have very different levels of prior knowledge of mech interp, and will try to control for this.

    • A fuzzy and hard to define criteria is shared research taste - I want to mentor scholars who are excited about the same kinds of research questions that I am excited about! I recommend against trying to optimise for this, but mention it because I want to be transparent about this being a factor.

    What background do applicants need?

    • You don’t need prior knowledge or research experience of mech interp, nor experience of ML, maths or research in general. Though all of these help and are a plus!

      • I outline important pre-requisites here - you can learn some of these on the go, but each helps, especially a background in linear algebra, and experience coding.

    • In particular, a common misconception is that you need to be a strong mathematician - it certainly helps, but I’ve accepted scholars with weak maths background who’ve picked up enough to get by

    • Mech interp is a sufficiently young field that it just doesn’t take that long to learn enough to do original and useful research, especially with me to tell you what to prioritise!

    How can I tell if I’d be a good fit for this?

    • If you think you have the skills detailed above, that’s a very good sign!

    • More generally, finding the idea of mech interp exciting, and being very curious about the idea of what’s inside models - if you’ve read a mech interp paper and thought “this is awesome and I want to learn more about it” that’s a good sign!

    • The training phase of the program is fairly competitive, which some people find very stressful - generally, my impression is that participants are nice and cooperative, especially since you want to form teams, but ultimately less than half will make it through to the research phase, which sucks.

Adam Shai
Co-founder, Simplex

Adam Shai is a co-founder of Simplex leveraging Computational Mechanics to analyze and predict the internal structures of language models with the research aim to operationalize advanced AI learning processes.

  • Adam Shai is the co-founder of Simplex, a nonprofit AI Safety organization focusing on using Computational Mechanics for AI Safety research. He is also a PIBBSS affiliate. Before his AI Safety work, he was an experimental and computational neuroscientist for more than a decade, doing mechanistic interpretability in rat brains at Stanford and Caltech.

  • What computational structure are we building into LLMs when we train them on next-token prediction? Using Computational Mechanics, we have a principled framework that allows us to make non-trivial predictions about the internal representations we expect to find in transformers; you can see our initial work at this LessWrong post, showing that we can find complicated fractal structures in the residual stream of transformers when trained on simple toy data structures. We are excited to extend this work in many directions.

    Now that we have a proof of principle that Computational Mechanics allows us to relate training data structure to both internal representations and behavior in a principled manner, I’m excited to push forward this research agenda. Some example projects I’d like to mentor are below:

    • Operationalizing in-context learning, transfer learning, abstraction, and generalization, and studying the behavioral and mechanistic underpinnings of these abilities as well as the conditions under which they occur

    • Applying computational mechanics to reinforcement learning agents

    • Benchmarking safety approaches like sparse-auto-encoders and weak-to-strong generalization

    • Using causal methods to validate the internal structures computational mechanics finds

    • Using mechanistic interpretability techniques to study predictive algorithms in transformers

    • Theoretical work about overlapping task and internal structures, compression of predictive structures, and agent foundations

  • Ideal candidates would have/be:

    • Software engineering skills (in python), especially with respect to data analysis and neural networks (e.g. pytorch)

    • Research experience - strong experimental design ability (e.g. red teaming results and ideas, running controls). A lot of research experience is not necessary but is helpful.

    • Strong working knowledge of linear algebra

    • Self-motivated and able to work productively on a project they own;

    • Interested/open-to in building a long-term collaboration

    I am open to a small number of more theoretically minded mentees. These scholars should have a strong math and physics background.

    Mentorship for scholars will likely be:

    • 1 hour weekly meetings with each scholar individually

    • team meetings every 2 weeks

    • providing detailed feedback on write-up drafts

    • Slack response time fast and often, typically ≤ 24 hours

Adrià Garriga Alonso
Research Scientist, FAR AI

Adrià is a Research Scientist at FAR AI focusing on advancing neural network interpretability and developing rigorous methods for AI interpretability.

  • Adrià is a Research Scientist at FAR AI. His previous relevant work includes Automatic Circuit Discovery, Causal Scrubbing, which are two attempts at concretizing interpretability. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks with Prof. Carl Rasmussen.

  • His work has two main threads: 1) Make interpretability a more rigorous science. The goal here is to accelerate interpretability by putting everyone's findings in firmer footing, thus letting us move fast without fear of making things up. Long-term goals of this agenda are to make a functioning repository of circuits for some large model, which anyone can contribute to and which can use previously-understood circuits.

    2) Understand how (and when) do NNs learn to plan. The theory of change here is to be able to focus the planning of NNs solely into well-understood outlets (e.g. scratchpads). The long-term goal here is to have a 'probing' algorithm that, applied to a NN, yields its inner reward or finds that there isn't one.

    Scholars in this stream in MATS Winter 2024 worked on:

    • A benchmark with ground-truth circuits for evaluating interpretability hypotheses, automatic discovery methods

    • Stronger ways of evaluating interpretability hypotheses (i.e. adversarial patching)

    • Replicating existing interpretability work in MAMBA.

  • N/a

Erik Jenner
PhD Student, CHAI

Erik Jenner is mentored by Stuart Russell and concentrates on mechanistic anomaly detection in AI systems. His work seeks to identify when AI models generate outputs for atypical reasons.

  • Erik is a PhD student at the Center for Human-Compatible AI at UC Berkeley, advised by Stuart Russell. He primarily works on mechanistic anomaly detection and sometimes also on interpretability and abstractions of neural networks. He has also published papers on reward learning, but note that he is not working on that topic at the moment. He has worked with around half a dozen mentees through MATS, CHAI internships, and otherwise.

  • MATS projects will most likely be on empirical work in mechanistic anomaly detection. Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” This might help with some of the problems that mechanistic interpretability is trying to address (like deceptive alignment), but unlike interpretability, mechanistic anomaly detection doesn’t aim for human understanding, which might make it more tractable. I think there is a lot of shovel-ready empirical work to be done on mechanistic anomaly detection, complementing ARC’s more theoretically motivated approach.

    Projects might involve coming up with difficult and realistic benchmarks for mechanistic anomaly detection or developing better methods. This involves coding, training/finetuning neural networks, and conceptually thinking about how to design benchmarks that meaningfully measure progress, or which methods are worth exploring.

    See this post for a more detailed list of research directions in mechanistic anomaly detection. I’m also interested in some topics that aren’t technically mechanistic anomaly detection but related in some ways, such as coup probes. I might also be open to projects closer to interpretability or abstractions of neural networks, but likely only on pretty specific topics.

  • Things I’m looking for:

    • Strong programming skills (ideally Python);

    • Significant deep learning skills (ideally pytorch);

    • Signs of research skills (e.g. based on previous experience, blog posts or papers, answers to selection questions)

    Mentorship will look roughly as follows, with details depending on your preferences:

    • At least a 1h/week meeting with each scholar, happy to meet more as necessary;

    • I’m in favor of frequent Slack communication between meetings, but flexible;

    • Detailed feedback on write-up drafts and occasional code review if you like;

    • I’m excited about trying out other forms of mentorship, such as occasional pair programming, red teaming our theory of change together, etc.

    Note that I will be on vacation with sporadic internet starting Aug 11th, so I might have reduced availability at the end of MATS. I will mostly be in Berkeley before that.

Alex Turner
Research Scientist, Google DeepMind

Alex is conducting research on "steering vectors" to manipulate and control language models at runtime, alongside exploring mechanistic interpretability of AI behaviors in maze-solving contexts and the theory of value formation in AI.

  • Alex is a Research Scientist at Google DeepMind. He’s currently working on steering vectors on DeepMind internal models. He's currently working on algebraically modifying the runtime properties of language models and doing mechanistic interpretability on maze-solving agents (this work was done by the MATS Winter 2022 Cohort’s team shard, mechanistic interpretability subteam). He also sometimes does theoretical work on the shard theory of value formation. In the past, he formulated and proved the power-seeking theorems, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

    Highlight outputs from past streams:

  • Mechanistic interpretability

    Shard theory aims to predict how tweaks to learning processes (eg changes to the reward function) affect policy generalization (e.g. whether the AI prioritizes teamwork or personal power-seeking in a given situation). Alex would like to supervise projects that mechanistically interpret and control existing networks.

    Steering vectors

    Why and how do steering vectors work so well? What is a “science of steering” we can discover, or at least some statistical regularities in when and how they work vs don’t work? EG what should go into creating the vectors, what governs the intervention strength required, etc. This probably looks like careful ablations and automated iteration over design choices.

    Discovering qualitatively new techniques

    Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. What other subfields can we find together?

  • Ideal candidates would have:

    • Academic background in machine learning, computer science, statistics, or a related quantitative field.

    • Familiarity with ML engineering.

    • Proven experience working on machine learning projects, either academically or professionally.

    • Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.

    • Ability to write up results into a paper.

    Mentorship looks like:

    • Weekly meetings (1-on-1 and group)

    • Slack presence

    • Limited support otherwise (per Google DeepMind rules, I can’t contribute code to the projects)

Arthur Conmy
Research Engineer, Google DeepMind

Arthur’s research focuses on discovering and innovativing methods for automating interpretability and applying model internals to critical safety tasks.

  • Arthur Conmy is a Research Engineer at Google DeepMind, on the Language Model Interpretability team with Neel Nanda. His interests are in automating interpretability, finding circuits and making model internals techniques useful for AI Safety. Previously, he worked at Redwood Research (and did the MATS Program!).

  • I’m most interested in supervising projects that propose original ways to scale interpretability, and/or show that model internals techniques are helpful for safety-relevant tasks (e.g. jailbreaks, sycophancy, hallucinations). I work with Sparse Autoencoders a lot currently, but it’s not a requirement to work on these if you work with me.

  • I’m quite empirically-minded and prefer discussing experiments and implementation to theory or philosophy. I think this leads to stronger outputs from projects, but if you would prefer more theoretical research other mentors may be a better fit. We would likely meet once a week for over an hour, and then more frequently when you have a blog post or paper to put out.

Jesse Hoogland
Founder, Timaeus

Jesse Hoogland spearheads research in singular learning theory and developmental interpretability, focusing on understanding and managing phase transitions in neural networks aiming to preemptively address the emergence of hazardous AI capabilities.

  • Jesse Hoogland is Co-Founder and Executive Director of Timaeus, an AI safety organization currently working on singular learning theory and "developmental interpretability," a new research agenda that aims to detect, locate, and interpret phase transitions in neural networks. He was previously a research assistant at David Krueger’s lab at Cambridge University and a MATS Research Scholar.

  • Background

    Timaeus is concentrated on two primary avenues of research: singular learning theory and developmental interpretability.

    Singular Learning Theory (SLT) is a mathematical theory of neural networks and other hierarchical learning machines established by Sumio Watanabe. This has led to concepts like the learning coefficient and insights into phase transitions in Bayesian learning. Further advances in singular learning theory look set to yield new tools for alignment.

    Developmental Interpretability (“devinterp”) is an approach to understanding the emergence of structure in neural networks, which is informed by singular learning theory but also draws on mechanistic interpretability and ideas from statistical physics and developmental biology. The key idea is that phase transitions organize learning and that detecting, locating, and understanding these transitions could pave a road to evaluation tools that prevent the development of dangerous capabilities, values, and behaviors.

    For an overview of how these fields fit together, see this and this talk from the 2024 Tokyo AI Safety conference. Here, Jesse and Stan van Wingerden sketch a vision for how singular learning theory and developmental interpretability could contribute to the alignment portfolio.

    Projects during MATS

    The research projects that we will be alongside MATS have not yet been written up. They will be shared with selected candidates.

  • Research

    We recommend checking out some of the publications, blog posts, or lectures Timaeus has put out to determine if this research area is something for you.

    SLT and devinterp are a natural fit for people with a background in mathematics (especially, algebraic geometry, empirical processes theory, dynamical systems, and learning theory), physics (especially statistical physics, catastrophe theory, and condensed matter), programming language theory, and developmental biology or neuroscience.

    Having a background in one of these fields is, however, not a prerequisite to apply. We also don’t expect you to necessarily be familiar with SLT. If anything, we’re prioritizing applicants with more of a software development background who are familiar with PyTorch and able to get ML experiments up and running quickly on their own.

    Mentorship

    In terms of mentorship, this will be done remotely via the devinterp discord. You’ll meet regularly (one-on-one) with Jesse and the other people that are in the project you’re working on.

Jason Gross
Technical Lead, Special Projects, ARC Theory

Jason Gross’s research aims to formalize mechanistic explanations of neural-net behavior to ensure reliable AI operations utilizing formal proofs and automated evaluations to enhance the transparency and accountability of AI systems.

  • Jason is a programming languages scientist. Previously, he worked on securing the internet and fighting with proof-assistants. Now, he leads a team working on formalizing mechanistic explanations of neural-net behavior. He is interested in both theory-heavy projects geared at guarantees and automated evaluation, and applied projects that use guarantees-compression to play with and understand models.

  • Summary of the proofs approach:

    • Motivation: To really discover what’s going on in large systems with emergent capabilities, we will develop automatically-generated explanations, including by delegating to AI systems. To make use of the automatically-generated explanations, we will need automatic evaluations/checking that cannot be Goodharted.

    • Premise: Explanations have two important properties: correspondence to the underlying system, and compression of the underlying system. Given a formal specification, correspondence is easy to achieve, however, without adequate compression we may never finish checking the explanation.

    • Question: How much can we compress the sprawling small noise in current model architectures? Empirically, noise seems unimpactful. But if we can’t compress it, maybe there’s an adversary hiding.

    • Perspective: Can we use compression based metrics to develop feedback loops for SOTA approaches like SAEs? The success of a technique is what it can explain, and we have a repertoire of tools for analyzing precisely what has been explained.

    Researchers are driven by different goals day-to-day: from seeking fundamental understanding of how learning happens, to fixating on the few test cases that a NN fails to find good solutions to. Projects go best when you find something that you are inspired to own. If you’re enthusiastic about the proofs approach, I am happy to work with you to find a project that will set you up to succeed.

    Some possible projects:

    • Train sparse autoencoders (SAEs) on a transformer trained on the algorithmic max-of-K task (see here or here) exploring the hypothesis that SAEs can give optimal case analysis structure for proofs. Questions:

      • Does using a combinatorial/multipicative sparsity penalty rather than an additive one give better results?

      • Where in the network do we want to insert SAEs to recover case analysis structure?

      • Would it be better to train SAEs that use rectified linear step functions rather than rectified linear unit functions (ReLU) to capture features that are not binary (on-off), such as the size direction in max-of-K?

      • (More open-ended) what does the proofs frame have to say about SAEs, and what can we learn about compact guarantees by playing with SAEs?

    • Toy models of structure incrementalism: Vertebrates have a blind spot because the optic nerve punches through the retina before attaching from the front. Evolution has not solved this problem because every incremental step needs to make the organism more fit, and discontinuous jumps won’t happen. Similarly, complex structures won’t arise except as a series of incrementally useful steps. Induction heads should form a toy model of this, where the absence of useful trigram statistics in the training data result in the inability to form the first layer of an induction head. Questions:

      • Are induced trigram statistics (from bigram statistics / markov chain generators) sufficient, or do the trigram statistics need to have predictive power even after screening off the value of the current token?

      • Is there a relationship between strength of usefulness of trigram statistics and speed of induction head formation?

        • We need to somehow measure speed of induction head formation

        • We need to measure strength of trigram statistics

    • Developing techniques to compress noise on more models (each training objective or model config can be a single “project” — formal proofs are hard!):

      • Questions:

        • What are the challenges introduced by each new kind of operation (multiple heads, multiple layers, layernorm)?

        • Is there a uniformity to the kind of model-specific insight required for noise-compression to various asymptotic bounds, even across models?

    • Develop expository tutorials for the proofs approach to mech interp, for instance Sorted List (1L, 2H, LN, attn-only) attempt for ARENA mech interp monthly challenges.

    • Get LLMs to formalize the proofs in proof assistants like Lean or Coq.

    My projects typically involve a combination of tasks:

    • Training toy models with PyTorch, sometimes including thinking about and adjusting the data generating procedure or hyperparameters

    • Exploring small trained models

    • Constructing and evaluating hypotheses for mechanistic explanations

    • Writing programs to bound the worst-case divergence of a concrete model from an estimate computed according to a mechanistic explanation

      • Iterating on the programs to produce non-vacuous bounds

      • Analyzing the computational complexity of the worst-case divergence computation

    • Writing up proof sketches or formal proofs that a given bound computation is correct

  • In addition to the skills and interests you would need for mech interp generally, the following are important for enjoying working on the proofs approach:

    • You like at least one of the following: complexity theory, analysis, number theory, formal logic, or bashing inequalities. (And if you don’t, you have a long explanation for it.)

    • You have developed mathematical fluency via research, competition math, or a rigorous undergrad program.

    • You have patience to wrestle with code. You don’t have to be a top engineer, but you will need the skills that are picked up by contributing to a large software project with many moving parts, or working through the tedious parts of ARENA.

    My mom is a pre-K teacher—most of her job is to empower students and nurture their vital curiosity. This is also my mentorship philosophy, and I am excited to work with researchers who are both opinionated and actively seeking feedback through action.

    I’ve had success with the following structure for previous fellowship cohorts:

    • You own a clear, well-defined (component of a) project.

    • You have scheduled co-working blocks or pairing sessions with collaborators.

    • Biweekly group meeting where everybody presents interim results.

    • Weekly 1:1 time of 1h (+2h as needed) to discuss research plan, stare at code, brainstorm, provide feedback on write ups, or provide other general support.

    • Variable availability via slack.

Jake Mendel
Research Scientist, Apollo Research

Jake Mendel is pioneering methods in AI interpretability as a fundamental science focusing on developing solid theoretical foundations and empirical techniques to dissect and improve AI model internals.

  • Jake is a Research Scientist at Apollo Research. He has recently published work on how neural networks might perform computation entirely in superposition, and on testing predictions of singular learning theory in the toy model of superposition. His current research interests center around trying to approach interpretability as a fundamental science, targeting a level of understanding that can scale to superhuman systems when more prosaic techniques fail.

  • I'm excited about projects that work towards a science of interpretability based on proper understanding and some degree of rigour, that engage with the massive possibility space of how model internals could work and what good interpretability might look like, and that are not likely to funge readily with the next person's ideas. This could look like projects which attempt to challenge flimsy assumptions and paradigms in the filed, and which work towards more solid theory for reverse engineering. Alternatively, it could look like approaches which take a bird's eye view of the field, and which tackle neglected parts of the path from where we are today to somewhere that could help with safety, even if this doesn't interface very heavily with mainstream mech interp work.

    As with most mentors (I would assume) I am most excited about new and exciting projects that mentees would propose but I'll list some work that I like or am excited about working on/mentoring as a reference class:

    Examples of relatively different ideas in mech interp that feel like they could be on the path to safe AI but which seem neglected by the field:

    • Developing and testing ways to identify the natural decomposition of language modeling into subtasks that respects the model's internals. I discuss the motivation behind this proposal and concrete ideas in much more detail in this post.

    • Designing models which are more interpretable by design, but which we do not try to make competitive.

    • Developing techniques to separate models into the parts that have just memorized lots of shallow facts and heuristics, and the parts that have more interesting computational structure.

    • Attempts to interface more directly with providing empirical feedback to questions in agent foundations. For example, focussing on understanding learned algorithms for learning, such as in context gradient descent. Can we reverse engineer algorithms for learning in context, or study how the ability to learn in context changes as models scale?

    Examples of theoretical foundations for interpretability:

    • Understanding computation in superposition. My current work to extend this involves:

      • Training neural networks to compress sparse boolean circuits and then attempting to reverse engineer them.

      • Thinking about error propagation and how you could use analysis of errors to identify features

      • Improving SAEs by changing their architecture substantially to respect the computational structure more, as briefly mentioned in Anthropic's summary of our work

      • Looking for more sophisticated techniques for taking features out of superposition by factoring in information about the computation done on features by the model, for example by also attempting to sparsify the gradients to the activations.

    • Building the right language for understanding model internals and critiquing existing work

    • Understanding and testing the linear representation hypothesis. Can we find compelling examples of nonlinear features?

    • Studying several of these questions and more by carefully and thoroughly analyzing how small boolean circuits are learned in toy models. Are features always linear? Are they spread over multiple layers? Is the network modular in any sense?

    • Work which helps us work out if our interpretations are valid (eg. causal scrubbing and other work with similar goals) and work which attempts to build good baselines and benchmarks

    • Attempts to redteam SAEs and the superposition hypothesis, via analysis of SAE reconstruction errors; testing SAEs on models where we have access to the ground truth, and trying to make these toy models as similar to the real cases as we can; thoughtful conceptual critiques; building better baselines to compare SAEs to, eg random vectors with the same covariance structure as the dataset, comparing to human interpretability scores on standard clusterings of activation vectors; developing techniques for identifying dense or nonlinear features that are being misinterpreted by SAEs

    • New ideas for how to test if you have really found the right features, to supplement and potentially supercede human interpretability based metrics

  • I’m looking for people with at least one of:

    • Strong quantitative skills

      • Do you have experience solving problems in a field like maths/theoretical physics/computer science?

      • At least confidence with linear algebra, calculus and probability theory. Deeper knowledge in relevant subfields like compressed sensing/boolean algebra/statistical physics would be great but not required

    • Strong ML engineering skills

      • Do you have experience training or interpreting real models?

      • Experience with python and relevant packages like pytorch

    In addition, we might work well together if you:

    • Develop your own project proposals and get excited about them.

    • Are truth seeking and attempt to be well calibrated about your hypotheses, engaging with the high levels of uncertainty that face all interpretability research

    • Have well formed views about the alignment problem, and about how interpretability may or may not be able to help

    Mentorship could take a range of shapes. At minimum, I aim to meet mentees for an hour a week, plus regular slack messages (>48 hours to reply) and comments on google docs, although I may end up offering to meet more than once a week.

Lucius Bushnaq
Interpretability Team Lead, Apollo Research

Lucius Bushnaq focuses on developing principled methods to decompose neural networks and discover parameterization-invariant representations through singular learning theory.

  • Lucius Bushnaq leads the interpretability team at Apollo Research. Their main research interests include finding principled ways to decompose neural networks into parts and discovering parametrization-invariant representations of neural networks using singular learning theory.

  • I believe that the Science of Deep Learning, particularly mechanistic interpretability, must reach a much more advanced state before most of the alignment research we need can even get properly started. Predictions are hard, especially about the future. But it seems to me that in many possible futures, we need to understand neural networks much better to get a good outcome.

    For example, I think making progress on critical questions in agent foundations, like what the type signature of 'goals' in real agent internals is, and how you make them robustly match specific goals you want, might first require understanding what 'abstraction-based reasoning' is. If we could completely reverse engineer an LLM, we might be able to understand how it performs abstract reasoning. Once we understand what an LLM does in a single forward pass, we might be in a position to start research on more advanced topics, like what a ‘goal’ is.

    Even if there’s no time left to do such alignment research, I think projects aiming for near-term defence against disaster benefit greatly from every additional insight into the functionality of neural networks. A deeper understanding of mechanistic interpretability is an aid in developing better techniques for detecting deception in models. Maybe even more crucially, it might let us know in advance and in detail how these techniques might fail. Debates about whether models are meaningfully aligned by interventions like RLHF would be easier to decisively settle in a way that is verifiable to people outside the field if we understood model internals well enough to directly observe and understand the effects of RLHF training.

    What kind of research projects am I interested in mentoring?

    First and foremost, I’m quite excited about people developing their own ideas.

    To give an impression of the flavour of research I’d most like to facilitate, here are some examples: 1, 2, 3, 4, 5, 6, 7.

    As a potential starting point, below is a list of some concrete research projects I’d be excited for someone to pick up. Most of them are about decomposing neural networks into elementary units of some kind, either in a principled way that matches the causal structure of the algorithms they implement or in a way that just seems practically useful for interpretability. A few of them instead investigate questions in singular learning theory that I think might be highly relevant for figuring out such a decomposition.

    In addition to these concrete projects, I’m also interested in attempts to characterise the geometry of feature directions SAEs learn. This would help us understand whether SAEs are giving sensible outputs that match what our knowledge of sparse coding would predict, and potentially yield additional metrics to judge SAE quality.

    • Finding parametrization-invariant representations of neural networks: Singular Learning Theory tells us that neural networks that generalise well will have many parameter choices that implement (almost) the same behaviour. So when we try to reverse engineer the algorithm a neural network is implementing by looking at its weights, some of what we see is ‘phantom’ structure that comes from the particular parameter choices this network happens to use, not from the causal structure of the algorithm we are trying to reverse engineer. This suggests that we might like to find a parametrization-invariant representation of neural network internals: a simpler representation that reflects the causal structure of the learned algorithm. We’ve made some progress on this at Apollo, but I’d be excited for people to try more in this direction.

    • Decomposing computational operations in neural networks: A lot of mechanistic interpretability work attempts to understand and reverse engineer networks by 1) Decomposing activation vectors into elementary basis directions, based on criteria like the basis being compressed or sparsely activating. 2) Understanding the interactions between those basis elements in different layers. Instead, one could try to directly decompose the interactions between adjacent layers into elementary basis operations, using tools like transcoders or tensor decompositions.

    • Testing the feature locality assumption: Most mechanistic interpretability assumes that ‘features’ in the network are located in a single layer. But features could theoretically live spread out over multiple layers, perhaps especially in architectures with skip connections, such as transformers. I’d be interested in people investigating how one could coherently test this locality assumption, or look for features spread out over multiple layers in a coherent way. A starting point for this might be to investigate to what extent the eigenvectors of the network’s Hessian can be localised to a single layer, connecting to prior work like this.

    • Quantifying the influence of terms above leading order on the learning coefficient: The learning coefficient is a quantity in singular learning theory that quantifies how simple the algorithm a network implements is. The learning coefficient can be calculated using a power series expansion of the network’s loss function around a point in the loss landscape. Theoretically, every term in this power series matters in determining the value of the learning coefficient. Practically, there’s still substantial debate on whether terms above leading order in this power series actually make a meaningful difference for the sort of networks we typically find in the real world. A project could try to calculate the leading and next-to-leading order contributions to the learning coefficient in some actual models like CIFAR MLPs and see how big the difference is. This could inform work on measuring the complexity of models in practice and work that tries to understand how the network’s internal structure relates to the learning coefficient.

    • Decomposing neural network layers into directions that influence the output independently: One way one could try to find linear ‘features’ in a neural network is by looking for characteristic directions in the layer’s activations that influence the final output in uncorrelated ways. Based on this hypothesis, I think one might be able to derive an optimisation problem involving a four-tensor in a layer’s activations and gradients, such that solutions to the optimisation problem yield the feature directions of the layer. A project on this might try to test the idea in small networks or check whether existing tools like Sparse Autoencoders find features that match this hypothesis.

  • By default, mentorship looks like 1h of weekly meetings with Slack messages in between, possibly more if something urgent arises. Jake Mendel’s and my own streams will be closely linked, with identical application questions, since we have similar research interests and work in the same team. Lee Sharkey is also on our interpretability team, so collaboration on projects with his scholars might also be possible. Dan Braun and Stefan Heimersheim of our interpretability team have agreed to help support our mentees with more coding-focused aspects of projects when they have the time.

    Some ability and interest in math or theory research are required for mentees to be a good fit, and those are also the skills I feel best placed to mentor people in. However, if you’re working in a team with another scholar, one person doing the math and another person doing the coding also works.

    I’m most interested in mentoring the sort of person who wants to come up with their own research ideas and agendas related to interpretability, neural network science or agent foundations. Aside from just having a research project proposal in the application, indicators I might look for to tell if a scholar is a good fit might be:

    • Does a scholar have a model of the alignment problem, their chosen sub-problem(s), and how working on the latter helps with the former.

    • Past alignment projects a scholar has worked on.

    • Evidence of ability to carry out independent research projects

    • Evidence of general mathematical knowledge

    • Evidence of general machine learning knowledge

    • Evidence of general ability at rational thinking

    • Knowledge of singular learning theory and sparse coding is a bonus

Lee Sharkey
Chief Strategy Officer, Apollo Research

Lee is Chief Strategy Officer at Apollo Research. His main research interests are mechanistic interpretability and “inner alignment.”

  • Lee Sharkey is Chief Strategy Officer at Apollo Research. Previously, Lee was a Research Engineer at Conjecture, where he recently published an interim report on superposition. His main research interests are mechanistic interpretability and “inner alignment.” Lee’s past research includes “Goal Misgeneralization in Deep Reinforcement Learning” and “Circumventing interpretability: How to defeat mind-readers.”

  • Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.

    Safety motivations aside, capable models are fundamentally extremely interesting objects of study and doing digital neuroscience on them is comparatively much easier than studying biological neural systems. Also, being a relatively new subfield, there is a tonne of low hanging fruit in interpretability ripe for the picking.

    I think mentorship works best when the mentee is driven to pursue their project; this often (but not always) means they have chosen their own research direction. As part of the application to this stream, I ask prospective mentees to write a project proposal, which forms the basis of part of the selection process. If chosen, depending on the research project, other Apollo Research staff may offer mentorship support.

    What kinds of research projects am I interested in mentoring?

    Until recently, I have primarily been interested in 'fundamental' interpretability research. But with recent fundamental progress, particularly from Cunningham et al. (2023), Bricken et al. (2023), and other upcoming work (including from other scholars in previous cohorts!), I think enough fundamental progress has been made that I'm now equally open to supervising applied interpretability work to networks of practical importance, particularly work that uses sparse dictionary learning as a basic interpretability method.

    Here is a list of example project ideas that I'm interested in supervising, which span applied, fundamental, and philosophical interpretability questions. These project ideas are only examples (though I'd be excited if mentees were to choose one of them). If your interpretability project ideas are not in this list, there is still a very good chance I am interested in supervising for it:

    • Examples of applied interpretability questions I'm interested in:

      • What do the sparse dictionary features mean in audio or other multimodal models? Can we find some of the first examples of circuits in audio/other multimodal models? (see Reid (2023) for some initial work in this direction)

      • Apply sparse dictionary learning to a vision network, potentially a convolutional network such as AlexNet or Inceptionv1, thus helping to complete the project initiated by the Distill thread that worked toward completely understanding one seminal network in very fine detail.

      • Can we automate the discovery of "finite state automata"-like assemblies of features, which partly describe the computational processes implemented in transformers, as introduced in Bricken et al. (2023).

    • Examples of fundamental questions I'm interested in:

      • How do we ensure that sparse dictionary features actually are used by the network, rather than simply being recoverable by sparse dictionary learning? In other words, how can we identify whether sparse dictionary features are functionally relevant?

      • Gated Linear Units (GLUs)(Shazeer, 2020), such as SwiGLU layers or bilinear layers, are a kind of MLP that is used in many public (e.g. Llama2, which uses SwiGLU MLPs) and likely non-public frontier models (such as PaLM2, which also uses SwiGLU). How do they transform sparse dictionary elements? Bilinear layers are an instance of GLUs that have an analytical expression, which makes them attractive candidates for studying how sparse dictionary elements are transformed in nonlinear computations in GLUs.

      • Furthermore, there exists an analytical expression for transformers that use bilinear MLPs (with no layer norm) (Sharkey, 2023). Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution? This may help in identifying fundamental structures within transformers in a similar way that induction heads were discovered.

      • RMSNorm is a competitive method of normalizing activations in neural networks. It is also more intuitive to understand than layer norm. Studying toy models that use it (such as transformers that use only attention and RMS norm as their nonlinearities) seems like a good first step to understanding their role in larger models that use it (such as Llama2). What tasks can such toy transformers solve and how do they achieve it?

      • I'm also open to supervising singular learning theory (SLT)-related projects but claim no expertise in SLT. Signing up with me for such projects would be high risk. So I’m slightly less likely to supervise you if you propose to do such a project, unless the project feels within reach for me. I'd be open to exploring options for a relatively hands off mentorship if a potential mentee was interested in pursuing such a project and couldn't find a more suitable mentor.

    • Examples of philosophical interpretability questions I'm interested in:

      • What is a feature? What terms should we really be using here? What assumptions do these concepts make and where does it lead when we take these assumptions to their natural conclusions? What is the relationship between the network’s ontology, the data-generating ontology, sparse dictionary learning, and superposition?

    Again, these project ideas are only examples. I’m interested in supervising a broad range of projects and encourage applicants to devise their own if they are inclined. (I predict that devising your own will have a neutral effect on your chances of acceptance in expectation: It will positively affect your chances in that I’m most excited by individuals who can generate good research directions and carry them out independently. But it will negatively affect your chances in that I expect most people are worse than I am at devising research directions that I in particular am interested in! Overall, I think the balance is probably neutral.)

  • As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:

    • Science background

      • What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?

    • Quantitative skills

      • How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research?

      • Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.

    • Engineering skills

      • How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)

    • Other interpretability prerequisites

    • Safety research interests

      • How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?

    • Conscientiousness

      • In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?

    In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.

    Mentorship looks like a 1 h weekly meeting by default with slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and light steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. For instance, in the previous round of MATS a collaboration that was organized through the EleutherAI interpretability channel worked quite well, culminating in Cunningham et al. (2023). With regard to write ups, I’m usually happy to invest substantial time giving inputs or detailed feedback on things that will go public.

Jessica Rumbelow
CEO, Leap Labs

Jessica's team at Leap Laboratories is working on novel AI interpretability techniques, applying them to state-of-the-art models to enhance understanding and control across various domains.

  • Jessica leads research at Leap Laboratories. She has previously published work on saliency mapping, AI in histopathology, and more recently glitch tokens and prototype generation. Her research interests include black-box/model-agnostic interpretability, data-independent evaluation, and hypothesis generation. Leap Laboratories is a research-driven startup using AI interpretability for knowledge discovery in basic science.

  • You'll work in a small team with your fellow scholars to undertake one or more research projects in AI interpretability. This will involve creative problem solving and ideation; reading existing literature and preparing literature reviews; designing and implementing experiments in clean and efficient code, documenting and presenting findings, and writing up your results for internal distribution and possible publication. Specific projects are yet to be determined, but will be broadly focussed on developing novel interpretability algorithms and/or applying them to SotA models across different modalities.

  • Candidates are expected to have some programming experience with standard deep learning frameworks (you should be able to train a model from scratch on a non-trivial problem, and debug it effectively), and to be able to read and implement concepts from academic papers easily.

    Candidates should be excited about documenting their research and code thoroughly – we keep daily lab books – and happy to join regular standup and research meetings with the Leap team.

    See more via Leap Labs culture.

Stephen Casper
PhD student, MIT AAG

Stephen's research includes red-teaming AI systems to understand vulnerabilities, applying adversarial methods to interpret and improve robustness, and exploring the potential of RLHF.

  • Stephen (“Cas”) Casper is a Ph.D student at MIT in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. Most of his research involves interpreting, red-teaming, and auditing AI systems but he sometimes works on types of projects too. In the past, he has worked closely with over a dozen mentees on various alignment-related research projects.

  • Some example works from Cas:

    Specific research topics for research with MATS can be flexible depending on interest and fit. However, work will most likely involve one of three topics (or similar):

    • AI red-teaming as a game of carnival ring toss: In games of carnival ring toss, there are typically many (e.g. hundreds) of pegs. When you toss a ring, you usually miss, and even when you succeed, you usually only hit a few rings. Red teaming seems to be like this. The number of possible inputs to modern AI systems is hyper-astronomically large, and past attempts to re-engineer even known problems with systems sometimes result in finding brand new ones. What if red-teaming, while useful, will always fail to be thorough? I’m interested in work that rounds up existing evidence, considers this hypothesis, and discusses what it may mean for technical safety and auditing policy.

    • Latent Adversarial Training (LAT): Some of my past (see above) and present work involves Latent Adversarial Training as a way of making models forget/unlearn harmful capabilities. It is a useful tool in the safety toolbox because it can be a partial solution for problems like jailbreaks, deception, trojans, and black swans. However, there is fairly little research to date on it, and I think there is a lot of low-hanging fruit. Some types of projects could focus on studying different parameterizations of attacks, conducting unrestricted attacks, or using latent-space attacks for interpretability.

    • How well does robustness against generalized adversaries predict robustness against practical threats: Most red-teaming and adversarial training with LLMs involves text-space attacks. Generalized attacks in the embedding or latent space of an LLM are more efficient and likely to better elicit many types of failure modes, but the existence of an embedding/latent-space vulnerability does not necessarily imply the existence of a corresponding input-space one. I am interested in work that aims to quantify how well robustness to generalized attacks predicts robustness to input-space attacks and few-shot finetuning attacks. This could be useful for guiding useful research on technical robustness methods and helping to figure out whether generalized attacks should be part of the evals/auditing toolbox.

  • Positive signs of good fit

    • Research skills – see below under “Additional Questions”.

    • Good paper-reading habits.

    Mentorship will look like:

    • Meeting 2-3x per week would be ideal.

    • Frequent check-ins about challenges. A good rule of thumb is to ask for help after getting stuck on something for 30 minutes.

    • A fair amount of independence with experimental design and implementation will be needed, but Cas can help with debugging once in a while. Clean code and good coordination will be key.

    • An expectation for any mentee will be to regularly read and take notes on related literature daily.

    • A requirement for any mentee on day 1 will be to read these two posts, watch this video, and discuss them with Cas.

    • Cas is usually at MIT but sometimes visits other places. In-person meetings would be good but are definitely not necessary.

    Cas may interview/email some applicants prior to making final decisions for MATS mentees in this stream.

    Collaboration/mentorship can come in many forms! Even if it would not be a great fit for Cas to be an MATS mentor, he may be able to advise or assist with certain projects on an informal, less-involved basis. This should be initiated by emailing him with a project idea.

Nandi Schoots
PhD Student, Safe and Trusted AI Centre

Nandi focuses on the "Science of Deep Learning" to have a better understanding of deep neural networks. She seeks to operationalize concepts like functional simplicity and phase changes in neural networks to identify and mitigate potential risks in AI systems.

  • If you join this stream you will be co-supervised by Nandi Schoots and Dylan Cope.

    Nandi will finish her PhD in the Safe and Trusted AI center for doctoral training under supervision of Peter McBurney (King’s College London) and Murray Shanahan (Imperial College London & DeepMind) this summer. Before starting her PhD, she briefly worked as an independent researcher with funding from Paul Christiano. Her thesis is in the field “Science of Deep Learning” which roughly corresponds to increasing understanding of deep neural networks. Some of her past work includes Any Deep ReLU Network is Shallow and Dissecting Language Models: Machine Unlearning via Selective Pruning.

    Dylan Cope is also a final year PhD student in the Safe and Trusted AI center for doctoral training supervised by Peter McBurney (King’s College London). He has previously worked as a visiting scholar at CHAI, UC Berkeley. Dylan’s thesis is on the emergence of communication between reinforcement learning agents. Some of his past work includes Learning to Communicate with Strangers via Channel Randomisation Methods and Improving Activation Steering in Language Models with Mean-Centring.

  • Research Agenda

    A particularly dangerous development is the recent surge in interest in open-ended and continual agent learning, i.e. never stopping the training process with the aim of creating a model that is ever improving. This creates danger as models will continue developing during deployment, which means model changes are instantaneously affecting the real world, so dangerous mechanisms need to be caught before they are fully realized, or it could be too late. Additionally, early detection is better than late detection because restarting training is expensive and removing dangerous behavior is difficult. We aim to tackle this issue by catching dangerous circuits before they emerge and adjusting training so that they are both less likely to emerge in the first place and so that they are easier to catch if they do emerge. This research agenda aims to prevent dangerous behavior before it emerges. We seek to advance the understanding of neural networks with the purpose of steering model development and preventing existential risk. Our work will focus on harnessing the science of deep learning to control models before and during training.

    A) Studying Functional Simplicity

    Neural networks that implement a more complex function than needed to solve a dataset, may have unnecessary and obscure behaviors out of distribution. We hypothesize that functional simplicity is correlated with safety relevant properties, where more complex functions are more likely or less likely to be dangerous.

    One way in which we can operationalise functional simplicity is as the extent to which we can distill a model into a smaller model.

    We propose to investigate what architectures, hyper-parameter settings and datasets lead to functions with more or less functional simplicity.

    Your project could also focus on investigating whether certain safety relevant properties such as power-seeking, situational awareness, or the existence of sleeper agents are correlated with functional simplicity. This would require applying mechanistic interpretability tools to uncover these safety relevant properties.

    Route to value:

    • A smaller model is easier to interpret. If we can distill a bigger model into a smaller one, then this will make interpretability easier.

    • The simplicity of a network (given a fixed training dataset) may be related to the odds that the network implements dangerous behaviors such as power-seeking or situational awareness.

    Day to day:

    • Reading literature on model distillation.

    • Distilling models into smaller models, i.e. training small models.

    • Investigating trends in functional simplicity, or applying mechanistic interpretability tools.

    • Strategize what to study next and pivot based on your findings.

    B) Increasing Modularity During Training

    Existing modularity metrics typically consider graph properties of the weights or correlations between neuron activations. We propose to extend work on correlations between activations to a metric based on information theory. Correlation is a statistic that captures linear relationships and is sensitive to the scale and distribution of the data. In contrast, mutual information can capture any kind of dependency between variables, making it more suitable for detecting information flow within and between different modules.

    The metric we have in mind calculates how much information flows between a set of neurons of one layer and a set of neurons in another layer. This calculation is based on conditional entropy of activations in layer l+k on activations in layer l.

    The following process should increase the model’s modularity during training:

    Step 1: Train a model for some training steps.

    Step 2: Use a clustering algorithm to find clusters of neurons in this model whose activations are correlated

    Step 3: Train the model for some training steps while incorporating the information theoretic modularity metric into the loss function.

    Repeat.

    Route to value:

    • Modular models are easier to interpret because it’s almost like investigating multiple smaller models.

    • In particular, making models modular by restricting information flow makes it less likely that there are obscure undetectable behaviors.

    Day to day:

    • Implementing the above steps 1, 2 and 3.

    • Training a variety of models using this training process.

    • Analyzing the modularity of the trained models.

    • Strategize what to study next and pivot based on your findings.

    C) Phase Changes

    We propose to study how new skills arise during model training. A clearer understanding of how new skills emerge might allow us to intervene on a training run before a harmful skill is learned. A particularly dangerous training phenomenon is a sudden change in model functionality, as exhibited by a sudden drop in test loss on a dataset (exemplifying a specific behavior), which is referred to as a phase change. You would investigate how prevalent such sudden changes are, as well as how they can be identified before or while they occur.

    The SI-Score dataset may be a good starting point for this investigation. This is a synthetic dataset that features images with highly detailed labeling, which allows for tracking the loss on a large number of qualitatively different subsets.

    You would study whether these sudden changes in test loss are a consequence of: 1) the emergence of a new circuit; or 2) a (qualitative) change in existing circuits.

    Route to value:

    • Suppose we are able to investigate trained models via e.g. mechanistic interpretability and model evaluations, but that this is quite costly. Then being able to predict when a significant model change has just happened, would help us to:

    • Identify a harmful model and discard it before wasting resources on fully training the model.

    • Identify a model with the beginnings of a harmful circuit, rather than a fully grown harmful model, which may even pose risk during evaluation.

    Day to day:

    • Implement very granular loss functions.

    • Train models while keeping track of the granular loss functions.

    • Investigate changes in loss. Hypothesize what causes the observations.

    • Strategize what to study next and pivot based on your findings.

    D) Other

    We are open to mentoring other projects as well. You can propose your own project in the space of “science of deep learning”/inductive biases, interpretability or steganography.

  • Ideal candidates should have:

    • Solid software engineering experience

    • Some knowledge of deep learning

    • Good communication skills

    Mentorship looks like:

    • 1 hour weekly meetings with each scholar individually (typically with both Nandi and Dylan joining the meeting)

    • Weekly team meetings discussing plans and logistics (with Nandi)

    • Slack response time typically ≤ 48 hours (Nandi or Dylan)

    • Providing feedback on write-up drafts

    • What we like about mentoring is:

    • Brainstorming with scholars

    • Striking a balance between guiding scholars through doing good research but also helping them develop their own unique research taste